Mining Protein Structure Data
نویسندگان
چکیده
This paper describes the application of machine learning algorithms to the discovery of knowledge in a protein structure database. The problem addressed is the determination of the solvent exposure of each amino acid residue, using different levels of exposed surface to define exposure. First we introduce the baseline classifier which achieves good prediction results despite only taking into account the amino acid type. Then we explain how we gathered and processed the data and built our classifier to improve the baseline prediction. Finally we test and compare several classifiers (e.g. Neural Networks, C5.0, CART and Chaid), and parameters (level of information per amino acid, SCOP class of protein, sliding window from the current amino acid) that might influence the prediction accuracy. We conclude by showing our models present a modest but statistically significant improvement over the baseline classifier’s accuracy.
منابع مشابه
Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: Combining GOR V and Fragment Database Mining (FDM)
One of the challenges in protein secondary structure prediction is to overcome the cross-validated 80% prediction accuracy barrier. Here, we propose a novel approach to surpass this barrier. Instead of using a single algorithm that relies on a limited data set for training, we combine two complementary methods having different strengths: Fragment Database Mining (FDM) and GOR V. FDM harnesses t...
متن کاملExtraction of Substructures of Proteins Essential to their Biological Functions by a Data Mining Technique
Correlation between the sequential, structural, and functional features of proteins is one of the most important open questions in the field of molecular biology. To this problem, we apply a technique known as data mining for discovering associations across protein sequence, structure, and function. We were able to find various association rules on the substructures essential to some protein fu...
متن کاملData Mining in Proteomics with Learning Classifier Systems
The era of data mining has provided renewed effort in the research of certain areas of biology that for their difficulty and lack of knowledge were and are still considered unsolved problems. One such problem, which is one of the fundamental open problems in computational biology is the prediction of the 3D structure of proteins, or protein structure prediction (PSP). The human experts, with th...
متن کاملA New Approach to Protein Structure Mining and Alignment
One of the largest areas of bioinformatic and data mining research has been in the protein domain. These efforts have included protein structure prediction, folding pathway prediction, sequence alignment, ab initio simulation, structure alignment, substructure detection and many others. Substructure detection is generally defined as the mining of a molecule’s 3D structure in order to find inter...
متن کاملAccuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)
Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...
متن کاملData Mining for Identification of Forkhead Box O (FOXO3a) in Different Organisms Using Nucleotide and Tandem Repeat Sequences
Background: Deregulation of FOXO3a gene which belongs to Forkhead box O (FOXO) transcription factors, can cause cancer (e.g. breast cancer). FOXO factors have important role in ubiquitination, acetylation, de-acetylation, protein-protein interactions and phosphorylation. Understanding the regulation and mechanisms of FOXO3a can lead to cancer treatment. The aim of this study recent association...
متن کامل